An important paradigm of natural language processing consists of large-scale pre-
training on general domain data and adaptation to particular tasks or domains. As
we pre-train larger models, full fine-tuning, which retrains all model parameters,
becomes less feasible. Using GPT-3 175B as an example – deploying indepen-
dent instances of fine-tuned models, each with 175B parameters, is prohibitively
expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-
trained model weights and injects trainable rank decomposition matrices into each
layer of the Transformer architecture, greatly reducing the number of trainable pa-
rameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam,
LoRA can reduce the number of trainable parameters by 10,000 times and the
GPU memory requirement by 3 times. LoRA performs on-par or better than fine-
tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite hav-
ing fewer trainable parameters, a higher training throughput, and, unlike adapters,
no additional inference latency. We also provide an empirical investigation into
rank-deficiency in language model adaptation, which sheds light on the efficacy of
LoRA. We release a package that facilitates the integration of LoRA with PyTorch
models and provide our implementations and model checkpoints for RoBERTa,
DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA
